Natural Language Processing on Heroku
Notes on the process of turning a locally experimented natural language processing algorithm into an API server on Heroku
(No consideration of whether it is appropriate to do this on Heroku → API server placement considerations)
I'll refer to my notes Heroku+Flask when I did something similar before.
Create a new working directory
I've thought about using existing repositories, but it's so complicated that it would take a long time to find the cause of a problem and be barren, so I'll try to keep it as simple as possible.
$ mkdir regroup-split-server
$ cd regroup-split-server
Create a virtual environment and work on VSCode
$ python3 -m venv venv
$ code .
View -> Terminal
$ source venv/bin/activate
Create a minimal server with Flask
$ mkdir server
$ code server/__init__.py
code:python
from flask import Flask
app = Flask(__name__)
def create_app():
return app
@app.route('/')
def root():
return "OK"
$ pip install --upgrade pip
$ pip install flask
Set environment variables in a file
$ code .env
code:.env
FLASK_APP=server
FLASK_ENV=development
$ pip install python-dotenv
$ flask run
Verify that it runs without problems and that you get an OK when you open http://127.0.0.1:5000/
$ git init
$ code .gitignore
code:.gitignore
venv/
*.pyc
__pycache__/
$ git commit -m 'minimal Flask server'
Actually, I'm doing Cmd+Enter in the Source Control tab of VSCode.
Add gunicorn and deploy.
This also serves as HTTPS for Flask, which is included in the minimum configuration because I think it is not possible to have only HTTP as a modern API server.
$ pip install gunicorn
$ pip freeze > requirements.txt
$ code Procfile
code:Procfile
web: gunicorn server:"create_app()"
$ heroku create regroup-split-server
$ git commit -m "add gunicorn"
$ git push --set-upstream heroku master
Build logs appear. Make sure it's not an error.
$ heroku open
Open the deployed one in a browser, making sure OK is displayed.
I haven't figured out yet what to do with this deployment repository and what to do with the local R&D repository to keep it clean.
I'm sure I'll want to separate them under certain circumstances, but until I have a clearer idea of how I want to separate them, I'm going to do it in unison.
I've had a hard link with separate repositories, but I don't think it's a good idea.
I guess I'd better use a git submodule or pip to connect them.
Resolving Application Dependencies with Git Submodules | Heroku Dev Center
PIP for myself
Keep folders separate for easy separation in the future.
$ mkdir server/regroup_split
Copy files that look necessary
code:deploy.sh
cp rich_tokenizer.py ../regroup-split-server/server/regroup_split/
cp regroup_split.py ../regroup-split-server/server/regroup_split/
cp TAIL_TOKENS_TO_REMOVE.txt ../regroup-split-server/server/regroup_split/
cp HEAD_TOKENS_TO_REMOVE.txt ../regroup-split-server/server/regroup_split/
cp test/simplelines1.txt ../regroup-split-server/server/regroup_split/test
cp test/regression_test.json ../regroup-split-server/server/regroup_split/test
Run unit tests and check for errors.
ModuleNotFoundError: No module named 'MeCab'
$ pip install mecab
Don't do this see mecab on heroku.
$ pip install mecab-python3==0.996.5
If the unit test passes, call the test from server/__init__.py
flask run to see if the test works on the local development server
It's easier to read error messages on the local development server than after deployment.
Common Corrections
Relative import from .foo import bar.
I usually run it as a script and experiment with it, but it is imported from the server and run as a module, so the import behavior changes.
Maybe it's better to IPython with %run -m on a regular basis.
path of a data file
If you're writing in a way that depends on the current directory at runtime, you'll get into trouble here.
Use DIR = os.path.dirname(__file__).
Push to heroku when it works locally
$ pip freeze > requirements.txt
Don't forget to ADD and COMMIT!
I guess I should have done that when I INSTALLED it.
$ git push
Build errors mecab on heroku.
After successful build, heroku open with 500 error
View runtime logs
$ heroku logs --tail
TypeError: 'dict_keys' object is not reversible
Python on heroku is 3.6 by default
By default, newly created Python apps use the python-3.6.12 runtime. --- Heroku Python Support | Heroku Dev Center
Align the executed version with the one at hand
$ echo python-3.8.7 > runtime.txt
Test cases now work on heroku as well.
Add an interface to return processed values passed from the server to the experimental scripts that have been running on the terminal and observing the results on the standard output.
In this case, it takes a string and returns a list of token strings.
At this point, it is up to you to decide whether you want to return a rich object or one that can be serialized in json.
By itself, it depends on the application.
I think the process of making json serializable is something that is required everywhere, so I think it would be good to have it on the library side.
Proper serialization can change as internal structures change, and
code:python
def process_single_line(line):
tokens = tokenize(line)
calc_split_priority(tokens)
return dict(
tokens=concat_tokens(tokens, " "),
split=concat_tokens(ts) for ts in split(tokens))
GET
code:python
@app.route('/api/', methods='GET')
def api():
text = request.args"q"
ret = regroup_split.process_single_line(text)
return ret
/api/?q=... Pass to GET to check operation with
Automatically serialized in JSON
POST
code:python
@app.route('/api/', methods='GET', 'POST')
def api():
if request.method == "GET":
text = request.args"q"
else:
text = request.json"q"
ret = regroup_split.process_single_line(text)
return ret
$ curl -X POST -H "Content-Type: application/json" -d '{"q":"test"}' localhost:5000/api/
operation check
git push to make sure it works on heroku as well
Create a client side that calls this API
code:python
import requests
import json
API_URL = "https://regroup-split-server.herokuapp.com/api/"
sample_text = "Ah, so people who are not used to the process of making lots of stickies and doing the KJ method don't have a good idea of how granular the information should be at the point of making the stickies in the first place. That's where the software needs to help."
payload = {"q": sample_text}
r = requests.post(API_URL, json=payload)
assert r.ok
for s in r.json()"split":
print(s)
"""
Expected output:
Make lots of sticky notes.
People unfamiliar with the process of doing the KJ method.
How granular is the information at the point where you make a sticky note?
I can't pinpoint a good one.
Software needs to support
"""
Call from JS
Flask-CORS
Done ✅ longest line is ticked with one click.
---
Flask to HTTPS
---
This page is auto-translated from /nishio/Herokuで自然言語処理 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.